Skip to content

feat: add complete automation infrastructure for plot generation#4

Merged
MarkusNeusinger merged 8 commits intomainfrom
claude/github-issue-poc-plan-01QbVSDkhbSCB2w6qKJe59m5
Nov 24, 2025
Merged

feat: add complete automation infrastructure for plot generation#4
MarkusNeusinger merged 8 commits intomainfrom
claude/github-issue-poc-plan-01QbVSDkhbSCB2w6qKJe59m5

Conversation

@MarkusNeusinger
Copy link
Copy Markdown
Owner

Summary

Adds complete automation infrastructure for AI-powered plot generation:

  • ✅ Versioned spec template system (v1.0.0)
  • ✅ Code generation with self-review loop
  • ✅ GitHub Actions workflows (@claude triggers for visibility)
  • ✅ Multi-version testing (Python 3.10-3.13)
  • ✅ Preview generation and quality check workflows
  • ✅ Spec upgrade tools (structural + AI-powered)

Workflows

  1. spec-to-code.yml: Posts @claude comment when "approved" label added
  2. claude.yml: Executes generation task with Claude Code
  3. test-and-preview.yml: Runs tests and generates previews (GCS optional)
  4. quality-check.yml: AI evaluation of generated plots (requires GCS)

Testing Plan

Phase 1 (No GCS required):

  • Create test issue with scatter-basic-001 spec
  • Add "approved" label
  • Watch code generation in Claude Code Web
  • Verify PR creation

Phase 2 (After GCS setup):

  • Preview generation and upload
  • Quality evaluation with vision

Related issue: Initial automation infrastructure setup

- Create detailed 13-phase plan for first POC version
- Include manual workflow orchestration via Claude Code
- Cover matplotlib and seaborn implementations
- Define self-review loop, multi-version testing, and preview optimization
- Specify GCS upload, PostgreSQL metadata, API, and minimal frontend
- Provide code examples, scripts, and troubleshooting guide
- Estimate 4-5 hours total implementation time

This plan validates the end-to-end workflow before automating with GitHub Actions.
Core Components:
- Spec template (.template.md) for consistent plot specifications
- Example spec: scatter-basic-001.md
- Plot generator with Claude + versioned rules + self-review loop
- Three GitHub Actions workflows for full automation

Workflows:
1. spec-to-code.yml: Auto-generates code when issue gets 'approved' label
   - Extracts spec from issue
   - Generates matplotlib + seaborn implementations
   - Self-review loop (max 3 attempts)
   - Creates PR automatically

2. test-and-preview.yml: Tests code and generates preview images
   - Multi-version testing (Python 3.10-3.13)
   - Generates preview PNGs
   - Uploads to GCS
   - Comments on PR with preview links

3. quality-check.yml: AI quality evaluation with Claude Vision
   - Downloads previews from GCS
   - Evaluates against spec quality criteria
   - Scores each implementation (0-100)
   - Comments with detailed feedback
   - Adds labels (quality-approved or quality-check-failed)

This infrastructure enables:
- Complete automation from GitHub Issue → Code → Test → Preview → Quality
- Self-review and quality gates built-in
- No manual steps required for plot generation
- Production-ready code output

Next: Test with scatter-basic-001 issue
Add version tracking to spec files for maintainability and automated upgrades.

Changes:
- Add version marker to spec template (v1.0.0)
- Update scatter-basic-001.md with version info
- Create upgrade_specs.py script for automated spec upgrades
- Add specs/VERSIONING.md documentation
- Update plot_generator.py to check and display spec version

Benefits:
- Easy template evolution without breaking existing specs
- Automated migration when template improves
- Clear tracking of which template version each spec uses
- Version upgrade can be batch-processed

Usage:
  # Check what would change
  python automation/scripts/upgrade_specs.py --dry-run

  # Upgrade all specs to latest version
  python automation/scripts/upgrade_specs.py

  # Upgrade specific spec
  python automation/scripts/upgrade_specs.py --spec scatter-basic-001

This enables continuous improvement of the spec template while maintaining backward compatibility.
Add Claude-powered upgrade system for intelligent spec improvements beyond
structural changes.

New Features:
- upgrade_specs_ai.py: AI-powered spec upgrader using Claude
  - Semantic improvements (better wording, clearer criteria)
  - Preserves spec ID and core intent
  - Automatic backup creation (.backup-{version} files)
  - Dry-run mode for preview
  - Single spec or batch upgrade

- specs/upgrades/: Directory for version-specific upgrade instructions
  - Custom instructions per version transition
  - Examples and before/after comparisons
  - Rationale for changes

- Updated VERSIONING.md with AI upgrade documentation

Why AI-Powered Upgrades?
- Structural changes (new sections) are easy with regex
- Semantic improvements (better wording, specificity) need understanding
- AI can reformulate quality criteria to be measurable
- AI can enhance parameter descriptions with types and ranges
- AI preserves intent while improving clarity

Usage:
  export ANTHROPIC_API_KEY=sk-ant-...

  # Preview changes
  python automation/scripts/upgrade_specs_ai.py --dry-run

  # Upgrade all specs (with backups)
  python automation/scripts/upgrade_specs_ai.py --version 1.0.0

  # Upgrade single spec
  python automation/scripts/upgrade_specs_ai.py --spec scatter-basic-001

Combined Strategy:
1. upgrade_specs.py - For structural changes (fast, no API calls)
2. upgrade_specs_ai.py - For semantic improvements (AI-powered)

This enables continuous spec quality improvement as we learn better
documentation patterns.
Replace direct Anthropic API calls with Claude Code Action for AI tasks.
This allows the workflows to use the existing CLAUDE_CODE_OAUTH_TOKEN
instead of requiring a separate ANTHROPIC_API_KEY.

Changes:

spec-to-code.yml:
- ✅ Trigger remains label-based ('approved' label on issue)
- ✅ Spec extraction remains unchanged
- ✅ Code generation now uses Claude Code Action
- ✅ Provides detailed prompt with rules and requirements
- ✅ Claude Code reads specs, rules, and generates implementations
- ✅ Self-review loop integrated into Claude Code execution
- ✅ Automatic commit and PR creation

quality-check.yml:
- ✅ Trigger remains workflow_run based
- ✅ Preview image download from GCS unchanged
- ✅ Quality evaluation now uses Claude Code Action
- ✅ Claude Code views images (Vision) and evaluates against specs
- ✅ Uses gh CLI to comment on PR and add labels
- ✅ Scores each implementation (0-100, ≥85 to pass)

Benefits:
- No separate API key needed
- Uses existing Claude Code Max subscription
- Integrated with Claude Code ecosystem
- Cleaner, more maintainable code
- Claude Code handles commits automatically
- Better error handling and retries

Hybrid Approach:
- Label triggers → Automation starts
- Claude Code → AI-powered tasks (generation, evaluation)
- Regular Actions → Infrastructure (tests, GCS upload, PR creation)

This enables full automation using only CLAUDE_CODE_OAUTH_TOKEN
and GCS_CREDENTIALS secrets.
Make automation workflows visible and trackable in Claude Code Web by using
@claude triggers instead of direct API calls. This allows you to see the
workflow progress live in your Claude Code Web session.

Changes:

1. spec-to-code.yml (Code Generation):
   - Simplified to just trigger @claude with instructions
   - GitHub Action posts @claude comment when 'approved' label is added
   - Claude Code (via claude.yml) picks up the task
   - User sees generation progress in Claude Code Web
   - Claude Code creates implementations, commits, and makes PR

2. quality-check.yml (Quality Evaluation):
   - Posts @claude comment with evaluation instructions
   - Claude Code views preview images and evaluates
   - User sees evaluation progress in Claude Code Web
   - Claude Code posts results and adds labels

3. claude.yml (Core):
   - Enhanced permissions: contents:write, pull-requests:write, issues:write
   - Allows Claude Code to commit and create PRs
   - Fetch full history for better context

Workflow Flow:
1. Issue gets 'approved' label
2. spec-to-code.yml posts @claude comment
3. claude.yml triggers (sees @claude)
4. Claude Code executes in Web (YOU SEE THIS LIVE!)
5. Claude Code commits implementations
6. Claude Code creates PR
7. test-and-preview.yml runs (regular CI)
8. quality-check.yml posts @claude comment
9. Claude Code evaluates (YOU SEE THIS LIVE!)
10. Claude Code posts quality report + labels

Benefits:
- 👁️ Full visibility in Claude Code Web
- 🎮 Can intervene if needed
- 📊 See progress in real-time
- 🔍 Better debugging when things go wrong
- 💬 Natural conversation-based workflow
- 🤖 Same AI throughout (Claude Code Max)
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Nov 24, 2025

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

This PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review:


✅ Strengths

Architecture & Design

  • Excellent workflow orchestration: The 4-workflow chain (spec-to-code → claude → test-and-preview → quality-check) is well thought out
  • Clean separation of concerns: Each workflow has a clear, focused purpose
  • Multi-version testing: Python 3.10-3.13 matrix testing ensures broad compatibility
  • Self-review loop: The plot_generator.py implements iterative refinement (max 3 attempts)
  • Versioned rules system: Enables A/B testing and rollback capabilities

Code Quality

  • Type hints: Proper use of Literal types and return type annotations in plot_generator.py
  • Error handling: Good use of try-except blocks with clear error messages
  • Documentation: Comprehensive docstrings throughout
  • Modular design: Clear separation between generation, validation, and file operations

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

Location: automation/generators/plot_generator.py:89-93

api_key = os.getenv("ANTHROPIC_API_KEY")
if not api_key:
    raise ValueError("ANTHROPIC_API_KEY environment variable not set")

Issue: While the code correctly uses environment variables, the model ID is hardcoded:

model="claude-sonnet-4-20250514"

Concerns:

  • This model ID may not exist or may be deprecated
  • Hardcoding prevents easy model switching
  • No API rate limiting or error handling for API calls

Recommendation:

model = os.getenv("CLAUDE_MODEL", "claude-sonnet-4-20250514")
# Add retry logic with exponential backoff
from anthropic import APIError, RateLimitError

try:
    response = client.messages.create(...)
except RateLimitError as e:
    # Handle rate limiting
except APIError as e:
    # Handle API errors

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

Location: .github/workflows/quality-check.yml:25-30

- name: Download preview metadata
  uses: actions/download-artifact@v4
  with:
    name: preview-metadata
    run-id: ${{ github.event.workflow_run.id }}
    github-token: ${{ secrets.GITHUB_TOKEN }}

Issue: The workflow_run event has known limitations with artifact downloading across workflow boundaries. Artifacts from the triggering workflow may not be accessible.

Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs.

3. Code Injection Vulnerability (test-and-preview.yml:111)

Location: .github/workflows/test-and-preview.yml:111

MPLBACKEND=Agg uv run python "" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"

Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:

  • Exfiltrate secrets
  • Modify repository contents
  • Access GitHub Actions environment

Critical Recommendation:

  1. Run in a sandboxed environment (Docker container with no network access)
  2. Use resource limits (CPU, memory, time)
  3. Only execute from trusted branches after review
  4. Consider using PyPy or RestrictedPython
- name: Generate preview images
  run: |
    docker run --rm --network none --cpus="1.0" --memory="512m" \
      -v $(pwd):/workspace -w /workspace \
      python:3.12-slim bash -c "MPLBACKEND=Agg python "

⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

LIBRARY=$(echo "" | cut -d'/' -f2)
SPEC_ID=$(echo "" | cut -d'/' -f4)
VARIANT=$(basename "" .py)

Issue: Hardcoded field positions will break if directory structure changes. No validation that the path matches the expected format.

Recommendation:

# Use regex with validation
if [[ $file =~ ^plots/([^/]+)/([^/]+)/([^/]+)/([^/]+)\.py$ ]]; then
  LIBRARY="${BASH_REMATCH[1]}"
  PLOT_TYPE="${BASH_REMATCH[2]}"
  SPEC_ID="${BASH_REMATCH[3]}"
  VARIANT="${BASH_REMATCH[4]}"
else
  echo "Invalid file path format: $file"
  exit 1
fi

5. Missing Error Handling (plot_generator.py:177-181)

if "python")[1].split("" in code:
    code = code.split("")[0].strip()

Issues:

  • No validation that code was successfully extracted
  • Will fail silently if format is unexpected
  • No syntax validation before saving

Recommendation:

import ast

def extract_code_from_response(text: str) -> str:
    if "python")[1].split("" in text:
        code = text.split("")[0].strip()
    else:
        code = text.strip()
    
    # Validate syntax
    try:
        ast.parse(code)
    except SyntaxError as e:
        raise ValueError(f"Generated code has syntax errors: {e}")
    
    return code

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

The workflow assumes GCS is always available. The if: steps.changed_plots.outputs.changed_files != '' condition doesn't account for missing GCS credentials.

Recommendation:

- name: Setup Google Cloud authentication
  if: steps.changed_plots.outputs.changed_files != '' && secrets.GCS_CREDENTIALS != ''
  continue-on-error: true  # Don't fail if GCS unavailable
  id: gcs_auth
  uses: google-github-actions/auth@v2
  with:
    credentials_json: ${{ secrets.GCS_CREDENTIALS }}

- name: Upload previews to GCS
  if: steps.gcs_auth.outcome == 'success'
  # ...

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

SPEC_ID=$(echo "$ISSUE_TITLE" | grep -oP '^[a-z]+-[a-z]+-\d{3}' || echo "")

Issues:

  • Only allows lowercase letters (fails for Scatter-basic-001)
  • Requires exactly 3 digits (fails for scatter-basic-1 or scatter-basic-1234)
  • No validation of spec file existence

Recommendation:

SPEC_ID=$(echo "$ISSUE_TITLE" | grep -oiP '^[a-z]+-[a-z]+-\d{3,4}' | tr '[:upper:]' '[:lower:]' || echo "")

if [ -n "$SPEC_ID" ] && [ ! -f "specs/${SPEC_ID}.md" ]; then
  echo "⚠️  Warning: Spec file specs/${SPEC_ID}.md does not exist"
fi

🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

- name: Run tests
  run: |
    uv run pytest tests/ -v --cov=plots --cov-report=xml --cov-report=term

Issue: Coverage is only measured for plots/ directory, but the PR adds code in automation/ that should also be tested.

Recommendation:

uv run pytest tests/ -v --cov=plots --cov=automation --cov-report=xml --cov-report=term

9. Magic Numbers & Hardcoded Values

  • max_tokens=4000 (line 171) - Why 4000? Should be configurable
  • max_tokens=2000 (line 219) - Different for review?
  • max_attempts: int = 3 - Should be configurable via environment variable
  • Plot type mapping (lines 100-104) - Incomplete and will fail for new libraries

10. Missing Logging

The plot_generator.py uses print statements instead of proper logging:

print(f"📋 Spec version: {spec_version}")
print(f"🔄 Attempt {attempt}/{max_attempts}")

Recommendation: Use Python's logging module for structured logging with levels (DEBUG, INFO, WARNING, ERROR).

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

if [ -f "test_output.png" ]; then
  mv test_output.png "$OUTPUT_FILE"
elif [ -f "output.png" ]; then
  mv output.png "$OUTPUT_FILE"
elif [ -f "${SPEC_ID}.png" ]; then
  mv "${SPEC_ID}.png" "$OUTPUT_FILE"

Issue: This assumes plot scripts save to these specific filenames. Better to enforce a convention via generation rules.

Recommendation: Require all plot scripts to save to a predictable location specified in an environment variable.


💡 Suggestions & Best Practices

12. Add Integration Tests

The PR adds workflows but no tests for them. Consider:

  • Testing plot_generator.py with mock API responses
  • Testing spec upgrade scripts
  • Testing path parsing logic in a unit test

13. Add Rate Limiting Protection

import time
from functools import wraps

def rate_limit(calls_per_minute: int):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=5)
def call_claude_api(...):
    ...

14. Add Telemetry

Track success/failure rates, generation times, review attempts, etc. to monitor system health.

15. Version String Comparison Issue (upgrade_specs.py:76)

if current_version < "1.0.0" and target_version >= "1.0.0":

Issue: String comparison doesn't work correctly for semantic versions ("1.10.0" < "1.2.0" as strings).

Recommendation: Use packaging.version.Version for proper semantic version comparison.

16. Add Dry-Run Mode to Workflows

Allow testing workflow logic without actual execution (e.g., workflow_dispatch with input parameter).


📋 Checklist for Merge

Must Fix Before Merge:

Should Fix Before Merge:

Nice to Have:

  • Add integration tests for plot_generator.py
  • Replace print statements with logging
  • Add rate limiting for API calls
  • Make configuration values (max_tokens, max_attempts) configurable

🎯 Overall Assessment

Quality Score: 75/100

Verdict: ⚠️ NEEDS IMPROVEMENTS before merge

Breakdown:

  • Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design
  • Security: ⭐⭐ (2/5) - Critical code execution vulnerability
  • Code Quality: ⭐⭐⭐⭐ (4/5) - Good but needs error handling
  • Testing: ⭐⭐⭐ (3/5) - Missing tests for new code
  • Documentation: ⭐⭐⭐⭐ (4/5) - Well documented

Summary

This is a solid foundation with excellent architectural design. The workflow orchestration and self-review loop are particularly well thought out. However, the critical security issue with executing untrusted code must be addressed before merging. The other issues around error handling, path parsing, and artifact passing should also be fixed for production readiness.

The code demonstrates strong understanding of the problem domain and follows project conventions well. With the security fixes and improved error handling, this will be a valuable addition to the project.


🚀 Recommendations

  1. Immediate: Address security issue Add Claude Code GitHub Workflow #3 (sandboxing)
  2. Short term: Fix critical path parsing and error handling issues
  3. Before production: Add comprehensive integration tests and monitoring
  4. Future: Consider adding metrics/observability for generation quality tracking

Great work on the automation infrastructure! Let me know if you'd like me to help with any of the fixes.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Nov 24, 2025

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

This PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review:


✅ Strengths

Architecture & Design

  • Excellent workflow orchestration: The 4-workflow chain (spec-to-code → claude → test-and-preview → quality-check) is well thought out
  • Clean separation of concerns: Each workflow has a clear, focused purpose
  • Multi-version testing: Python 3.10-3.13 matrix testing ensures broad compatibility
  • Self-review loop: The plot_generator.py implements iterative refinement (max 3 attempts)
  • Versioned rules system: Enables A/B testing and rollback capabilities

Code Quality

  • Type hints: Proper use of Literal types and return type annotations in plot_generator.py
  • Error handling: Good use of try-except blocks with clear error messages
  • Documentation: Comprehensive docstrings throughout
  • Modular design: Clear separation between generation, validation, and file operations

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

Location: automation/generators/plot_generator.py:89-93

While the code correctly uses environment variables, the model ID is hardcoded. This prevents easy model switching and has no API rate limiting or error handling for API calls.

Recommendation: Make model configurable and add retry logic with exponential backoff for rate limiting and API errors.

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

Location: .github/workflows/quality-check.yml:25-30

The workflow_run event has known limitations with artifact downloading across workflow boundaries. Artifacts from the triggering workflow may not be accessible.

Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs.

3. Code Injection Vulnerability (test-and-preview.yml:111)

Location: .github/workflows/test-and-preview.yml:111

MPLBACKEND=Agg uv run python "$file" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"

Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:

  • Exfiltrate secrets
  • Modify repository contents
  • Access GitHub Actions environment

Critical Recommendation: Run in a sandboxed environment (Docker container with no network access) with resource limits (CPU, memory, time), and only execute from trusted branches after review.


⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

LIBRARY=$(echo "$file" | cut -d'/' -f2)
SPEC_ID=$(echo "$file" | cut -d'/' -f4)
VARIANT=$(basename "$file" .py)

Issue: Hardcoded field positions will break if directory structure changes. No validation that the path matches the expected format.

Recommendation: Use regex with validation to ensure path matches expected format before parsing.

5. Missing Error Handling (plot_generator.py:177-181)

Code extraction from Claude responses has no validation that code was successfully extracted, will fail silently if format is unexpected, and has no syntax validation before saving.

Recommendation: Add Python AST parsing to validate syntax before saving generated code.

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

The workflow assumes GCS is always available. The condition doesn't account for missing GCS credentials.

Recommendation: Add proper conditional logic to gracefully handle missing GCS credentials with continue-on-error: true.

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

Issues:

  • Only allows lowercase letters (fails for Scatter-basic-001)
  • Requires exactly 3 digits (fails for scatter-basic-1 or scatter-basic-1234)
  • No validation of spec file existence

Recommendation: Use case-insensitive regex and validate spec file exists before proceeding.


🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

Coverage is only measured for plots/ directory, but the PR adds code in automation/ that should also be tested.

Recommendation: Add --cov=automation to pytest command.

9. Magic Numbers & Hardcoded Values

  • max_tokens=4000 (line 171) - Why 4000? Should be configurable
  • max_tokens=2000 (line 219) - Different for review?
  • max_attempts: int = 3 - Should be configurable via environment variable
  • Plot type mapping (lines 100-104) - Incomplete and will fail for new libraries

10. Missing Logging

The plot_generator.py uses print statements instead of proper logging. Use Python's logging module for structured logging with levels (DEBUG, INFO, WARNING, ERROR).

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

Assumes plot scripts save to specific filenames. Better to enforce a convention via generation rules and use environment variables.


💡 Suggestions & Best Practices

12. Add Integration Tests

The PR adds workflows but no tests for them. Consider:

  • Testing plot_generator.py with mock API responses
  • Testing spec upgrade scripts
  • Testing path parsing logic in a unit test

13. Add Rate Limiting Protection

Implement rate limiting decorator for Claude API calls to prevent hitting rate limits.

14. Add Telemetry

Track success/failure rates, generation times, review attempts, etc. to monitor system health.

15. Version String Comparison Issue (upgrade_specs.py:76)

if current_version < "1.0.0" and target_version >= "1.0.0":

Issue: String comparison doesn't work correctly for semantic versions ("1.10.0" < "1.2.0" as strings).

Recommendation: Use packaging.version.Version for proper semantic version comparison.

16. Add Dry-Run Mode to Workflows

Allow testing workflow logic without actual execution (e.g., workflow_dispatch with input parameter).


📋 Checklist for Merge

Must Fix Before Merge:

Should Fix Before Merge:

Nice to Have:

  • Add integration tests for plot_generator.py
  • Replace print statements with logging
  • Add rate limiting for API calls
  • Make configuration values (max_tokens, max_attempts) configurable

🎯 Overall Assessment

Quality Score: 75/100

Verdict: ⚠️ NEEDS IMPROVEMENTS before merge

Breakdown:

  • Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design
  • Security: ⭐⭐ (2/5) - Critical code execution vulnerability
  • Code Quality: ⭐⭐⭐⭐ (4/5) - Good but needs error handling
  • Testing: ⭐⭐⭐ (3/5) - Missing tests for new code
  • Documentation: ⭐⭐⭐⭐ (4/5) - Well documented

Summary

This is a solid foundation with excellent architectural design. The workflow orchestration and self-review loop are particularly well thought out. However, the critical security issue with executing untrusted code must be addressed before merging. The other issues around error handling, path parsing, and artifact passing should also be fixed for production readiness.

The code demonstrates strong understanding of the problem domain and follows project conventions well. With the security fixes and improved error handling, this will be a valuable addition to the project.


🚀 Recommendations

  1. Immediate: Address security issue Add Claude Code GitHub Workflow #3 (sandboxing)
  2. Short term: Fix critical path parsing and error handling issues
  3. Before production: Add comprehensive integration tests and monitoring
  4. Future: Consider adding metrics/observability for generation quality tracking

Great work on the automation infrastructure! The architecture is sound and aligns well with the project's goals. Focus on addressing the security concerns and error handling, and this will be production-ready.

@MarkusNeusinger MarkusNeusinger self-assigned this Nov 24, 2025
Addresses critical review feedback with 5 key improvements:

1. GCS Optional: Make GCS upload optional with proper error handling
   - Add continue-on-error and conditional checks
   - Workflows work without GCS credentials

2. Spec ID Extraction: Case-insensitive, flexible digit count
   - Support 3-4 digits (001-9999)
   - Convert to lowercase automatically
   - Validate spec file existence

3. Code Syntax Validation: AST-based validation before saving
   - New extract_and_validate_code() function
   - Catches syntax errors early with retry
   - Clear error messages

4. Path Parsing: Regex-based validation with error handling
   - Replace fragile cut-based parsing
   - Validate format: plots/{lib}/{type}/{spec}/{variant}.py
   - Skip invalid files with warning

5. API Error Handling: Retry with exponential backoff
   - Retry on RateLimitError and APIConnectionError
   - 3 retries with 2s, 4s, 8s delays
   - Don't retry on other API errors

Ready for initial testing without GCS setup.
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Nov 24, 2025

🔍 Code Review

Thank you for this comprehensive PR adding automation infrastructure for plot generation! This is a well-structured implementation with good separation of concerns. Below is my detailed review:


✅ Strengths

  1. Excellent Architecture

    • Clean separation between spec generation, code generation, testing, and quality checks
    • Well-designed workflow chaining (spec-to-code → test-and-preview → quality-check)
    • Proper use of versioned rules system for maintainability
  2. Good Error Handling

    • Retry logic with exponential backoff in plot_generator.py:57-101
    • Proper error messages and exit codes
    • Validation of API keys and file paths
  3. Security Best Practices

    • Secrets properly managed via GitHub Secrets (not hardcoded)
    • GCS credentials handled via OIDC with id-token: write
    • Appropriate use of continue-on-error for optional GCS upload

🐛 Potential Bugs & Issues

High Priority

  1. Spec ID Extraction May Miss Valid IDs (spec-to-code.yml:16-44)

    • Current regex: [a-z]+-[a-z]+-\d{3,4} only allows lowercase
    • But the pattern allows "3-4 digits" which could match IDs like scatter-basic-0001
    • Issue: Inconsistent with docs stating {001-9999} format
    • Recommendation: Standardize to exactly 3 digits: [a-z]+-[a-z]+-\d{3}
  2. Checkout Missing in spec-to-code.yml

    • The workflow extracts spec ID and validates file existence (line 38) but never checks out the repository
    • Fix: Add checkout step before the extract step:
    - name: Checkout repository
      uses: actions/checkout@v4
  3. Unsafe Variable Reference in Quality Check (quality-check.yml:64-66)

    • JavaScript template literal uses unescaped variable in string interpolation
    • While unlikely to be exploited here, it's better to use proper escaping
    • Recommendation: Pass as environment variables instead
  4. Code Extraction Logic May Fail (plot_generator.py:37-40)

    • Splits on triple backticks which could fail if there are multiple code blocks
    • Better approach: Use more robust parsing with proper indexing

Medium Priority

  1. Preview Image Name Parsing Bug (test-and-preview.yml:164-167)

    • cut -d'_' -f1-3 assumes spec_id has exactly 3 parts separated by dashes
    • For scatter-basic-001_matplotlib_default.png, this would produce scatter-basic-001
    • But if spec IDs ever have more dashes, this breaks
    • Recommendation: Parse filename using bash regex matching instead
  2. Missing Error Handling for jq Failures (test-and-preview.yml:88)

    • If jq fails, the while loop silently continues
    • Recommendation: Add set -e at start of bash scripts or check jq exit code
  3. Hardcoded Model Name (plot_generator.py:256, 309)

    • Model claude-sonnet-4-20250514 is hardcoded
    • Recommendation: Make configurable via environment variable with this as default

Low Priority

  1. Version String Comparison May Fail (upgrade_specs.py:76)

    • Uses string comparison current_version < "1.0.0"
    • Works for simple versions but breaks for "1.10.0" < "1.9.0" (string comparison)
    • Recommendation: Use proper semantic versioning library like packaging.version.parse()
  2. Potential Race Condition in Preview Generation (test-and-preview.yml:121-132)

    • Checks for hardcoded filenames (test_output.png, output.png, etc.)
    • If multiple previews run in parallel (they don't currently), they could collide
    • Not a current issue since it's sequential, but worth noting

🚀 Performance Considerations

  1. Workflow Chaining Overhead

    • quality-check.yml waits for test-and-preview.yml to fully complete
    • Consider using workflow artifacts more efficiently or running quality checks in parallel with tests
  2. Multi-Python Testing

    • Testing on 4 Python versions (3.10-3.13) is thorough but adds ~4x time
    • Optimization: Consider testing 3.10 and 3.13 only on PRs, full matrix on main branch
  3. API Token Usage

    • Each generation attempt uses 2 API calls (generation + self-review)
    • With max 3 attempts = 6 API calls per implementation
    • Cost consideration: Monitor token usage, may want to reduce max_attempts to 2
  4. GCS Download/Upload

    • Downloads entire preview directory with gsutil -m cp -r
    • For many files, this could be slow
    • Current scale: Not an issue yet, but consider pagination if >100 files

🔒 Security Concerns

  1. Command Injection Risk (Low) (test-and-preview.yml:98-132)

    • Uses $file variable from git output in bash commands
    • While git output is trusted, it's good practice to quote variables
    • Fix: Change to uv run python "$file" (already quoted ✅)
  2. Secrets Exposure in Logs

    • GCS_BUCKET is used in URLs (quality-check.yml:65)
    • If bucket name is sensitive, this exposes it in comments
    • Recommendation: Document that bucket name is not considered secret
  3. Untrusted Code Execution (test-and-preview.yml:118)

    • Executes generated Python code with uv run python "$file"
    • This is by design but worth noting: malicious specs could generate harmful code
    • Mitigation: Run in isolated GH Actions environment (✅ already done)

🧪 Test Coverage

Missing Tests

  1. No Unit Tests for plot_generator.py

    • Critical component with no test coverage
    • Recommendation: Add tests in tests/unit/automation/test_plot_generator.py:
      • Test extract_and_validate_code() with various inputs
      • Mock API calls to test retry logic
      • Test error handling for missing files
  2. No Tests for Upgrade Scripts

    • upgrade_specs.py and upgrade_specs_ai.py lack tests
    • Recommendation: Test version detection, upgrade logic, backup creation
  3. No Workflow Tests

    • Consider adding workflow validation with actionlint
    • Check for common YAML/workflow issues

Test Structure Recommendations

# tests/unit/automation/generators/test_plot_generator.py
def test_extract_and_validate_code_success():
    response = '''Here's the code:
    ```python
    def create_plot():
        pass
    ```'''
    code = extract_and_validate_code(response)
    assert "def create_plot" in code

def test_extract_and_validate_code_syntax_error():
    with pytest.raises(ValueError, match="syntax errors"):
        extract_and_validate_code("def invalid syntax")

📝 Code Quality Suggestions

  1. Add Type Hints to upgrade_specs.py

    • Line 76: current_version < "1.0.0" comparison lacks type safety
    • Import from typing import Optional
  2. Improve Docstrings

    • plot_generator.py has good docstrings ✅
    • upgrade_specs.py could use more detailed examples in docstrings
  3. Extract Magic Strings

    • Multiple workflows reference "claude-sonnet-4-20250514"
    • Consider centralizing in a config file or environment variable
  4. Consider Adding Logging

    • plot_generator.py uses print() statements
    • For production, consider using Python logging module for better control
  5. Shell Script Best Practices

    • Add set -euo pipefail at the start of bash blocks for better error handling
    • Example:
    run: |
      set -euo pipefail
      # rest of script

🎯 Recommendations Summary

Must Fix Before Merge

  1. ✅ Add checkout step to spec-to-code.yml
  2. ✅ Fix preview image name parsing in test-and-preview.yml:164-167
  3. ✅ Add basic unit tests for plot_generator.py

Should Fix Soon

  1. Standardize spec ID format (3 vs 4 digits)
  2. Improve version string comparison in upgrade scripts
  3. Add proper error handling for jq failures

Nice to Have

  1. Add workflow validation with actionlint
  2. Consider making model name configurable
  3. Add logging instead of print statements
  4. Extract magic strings to config

📊 Overall Assessment

Code Quality: ⭐⭐⭐⭐☆ (4/5) - Well structured with good patterns, minor issues

Security: ⭐⭐⭐⭐⭐ (5/5) - Proper secret management, appropriate isolation

Test Coverage: ⭐⭐☆☆☆ (2/5) - Missing unit tests for critical components

Documentation: ⭐⭐⭐⭐☆ (4/5) - Good inline docs, comprehensive CLAUDE.md

Performance: ⭐⭐⭐⭐☆ (4/5) - Reasonable for current scale, some optimization opportunities


✅ Conclusion

This is a solid foundation for the automation infrastructure! The architecture is well-designed and follows GitHub Actions best practices. The main concern is the lack of unit tests for critical components like plot_generator.py.

Recommendation: After addressing the "Must Fix" items (especially the missing checkout step and adding basic tests), this PR will be ready to merge.

Great work on the comprehensive automation setup! 🎉

@MarkusNeusinger MarkusNeusinger merged commit 7ca4f26 into main Nov 24, 2025
2 of 6 checks passed
@MarkusNeusinger MarkusNeusinger deleted the claude/github-issue-poc-plan-01QbVSDkhbSCB2w6qKJe59m5 branch November 24, 2025 12:59
MarkusNeusinger added a commit that referenced this pull request Dec 1, 2025
…an-01QbVSDkhbSCB2w6qKJe59m5

feat: add complete automation infrastructure for plot generation
MarkusNeusinger added a commit that referenced this pull request Dec 16, 2025
#1089)

Fixes #967

## Summary
- Update quality threshold documentation from 85 to 90
- Add CLI retry logic for intermittent failures
- Change issue lifecycle: spec-ready issues stay open until all
implementations complete

## Changes

### Documentation (Issue Finding #4)
Updates quality threshold from 85 to 90 in documentation to match actual
workflow configuration in `impl-review.yml`.

**Files:**
- CLAUDE.md
- README.md
- docs/workflow.md
- docs/concepts/claude-skill-plot-generation.md
- prompts/quality-evaluator.md

### CLI Retry Logic (Issue Finding #3)
Adds retry mechanism for Claude CLI steps to handle intermittent
"Executable not found in $PATH" errors. Each Claude step now has
`continue-on-error: true` and a retry step that runs if the first
attempt fails.

**Files:**
- .github/workflows/spec-create.yml
- .github/workflows/spec-update.yml
- .github/workflows/impl-generate.yml
- .github/workflows/impl-repair.yml
- .github/workflows/util-claude.yml

### Issue Lifecycle (Bonus)
- `spec-ready` issues now stay open until all 9 library implementations
are merged
- Changed `Closes #...` to `Related to #...` in spec PR body
- Added auto-close logic in `impl-merge.yml` when all
`impl:{library}:done` labels are present

**Files:**
- .github/workflows/spec-create.yml
- .github/workflows/impl-merge.yml

## Test Plan
- [ ] Verify documentation shows correct 90 threshold
- [ ] Trigger a workflow and verify retry works on CLI failure
- [ ] Create a test spec and verify issue stays open after merge
- [ ] Verify issue closes when all 9 libraries have
`impl:{library}:done`
MarkusNeusinger added a commit that referenced this pull request Dec 17, 2025
- Fix #1: Handle missing remote branches in impl-generate.yml
  - Check if branch exists before checkout to avoid 'not a commit' errors
  - Fall back to creating fresh branch from main if remote doesn't exist

- Fix #2: Clean up duplicate labels on failure
  - Remove both 'generate:X' and 'impl:X:pending' when marking as failed
  - Prevents label accumulation (e.g., both pending and failed)

- Fix #3: Auto-close issues when done + failed = 9
  - Previously only closed when all 9 were 'done'
  - Now closes when total (done + failed) reaches 9
  - Shows which libraries failed in closing comment

- Fix #4: Track generation failures and auto-mark as failed
  - Count previous failed runs for same spec/library
  - After 3 failures, mark as 'impl:X:failed' automatically
  - Posts failure comment explaining the library may not support this plot type
MarkusNeusinger added a commit that referenced this pull request Dec 17, 2025
…1308)

## Summary

Fixes several workflow issues discovered during batch processing of
spec-ready issues.

### Fix #1: Branch-Not-Found Errors
**Problem:** `fatal: 'origin/implementation/{spec}/{library}' is not a
commit` errors when workflow tries to checkout a non-existent remote
branch.

**Solution:** Check if remote branch exists before checkout, fall back
to creating fresh branch from main.

### Fix #2: Duplicate Labels
**Problem:** Issues accumulate both `impl:X:pending` and `impl:X:failed`
labels when generation fails.

**Solution:** Failure handler removes both `generate:X` and
`impl:X:pending` when marking as failed.

### Fix #3: Auto-Close with Failures
**Problem:** Issues with 8 done + 1 failed stay OPEN because auto-close
only triggers on 9 done.

**Solution:** Close when `done + failed = 9`, with appropriate summary
(shows which libraries failed).

### Fix #4: Generation Failure Tracking
**Problem:** When `impl-generate` fails (no plot.png), no PR is created
→ no review → no repair → library stays `pending` forever.

**Solution:** Track generation failures and mark as `impl:X:failed`
after 3 consecutive failures. Posts comment explaining the library may
not support this plot type.

## Files Changed
- `.github/workflows/impl-generate.yml` - Fixes #1, #2, #4
- `.github/workflows/impl-merge.yml` - Fix #3

## Testing
- YAML syntax validated
- Logic reviewed against observed failure patterns
MarkusNeusinger added a commit that referenced this pull request Jan 26, 2026
Critical fixes:
- Fix TypeError: flatten tag dict to list for repository call (#6)
  Changed search_by_tags to receive list[str] instead of dict[str, list[str]]
  Repository expects flat list of tag values, not nested dict

- Add missing dataprep and styling parameters (#1, #7)
  Added to search_specs_by_tags function signature and docstring
  These categories were documented but not implemented

- Add filter logic for dataprep and styling (#2)
  Implemented filtering checks similar to other impl-level tags
  Ensures new parameters actually filter results

- Update condition to include dataprep and styling (#3)
  Modified impl-level filtering condition on line 117
  Now checks all 6 impl-level tag categories

Improvements:
- Add database error handling with helpful messages (#8)
  Check is_db_configured() before all database operations
  Provides clear error message if DATABASE_URL not set

- Update test mocks to match fixed interface (#5)
  Tests now verify flattened tag list instead of dict
  Added new test for dataprep/styling filter parameters
  Mock is_db_configured to return True in test fixture

Verification:
- All 16 unit tests passing
- Ruff linting and formatting applied
- No routing conflicts (#4 verified - no /mcp routes in routers)

Related: PR #4132, Issue #4129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants